Code
import pandas as pd
kakamana
March 19, 2023
This chapter will introduce us to messy, incomplete data. In this lesson, we’ll learn how to find missing values in our data and explore multiple approaches to dealing with them. We’ll also use string manipulation to get rid of unwanted characters.
This Dealing with messy data is part of Datacamp course: Feature engineering for machine learning in Python
This is my learning experience of data science through DataCamp. These repository contributions are part of my learning journey through my graduate program masters of applied data sciences (MADS) at University Of Michigan, DeepLearning.AI, Coursera & DataCamp. You can find my similar articles & more stories at my medium & LinkedIn profile. I am available at kaggle & github blogs & github repos. Thank you for your motivation, support & valuable feedback.
These include projects, coursework & notebook which I learned through my data science journey. They are created for reproducible & future reference purpose only. All source code, slides or screenshot are intellactual property of respective content authors. If you find these contents beneficial, kindly consider learning subscription from DeepLearning.AI Subscription, Coursera, DataCamp
How gaps in data occur
* Data not being collected properly
* Collection and management errors
* Data intentionally being omitted
* Could be created due to transformations of the data
Why we care?
* Some models cannot work with missing data (Nulls/NaN)
* Missing data may be a sign a wider data issue
* Missing data can be a useful feature
How sparse is my data?
Most data sets contain missing values, often represented as NaN (Not a Number). If you are working with Pandas you can easily check how many missing values exist in each column.
SurveyDate | FormalEducation | ConvertedSalary | Hobby | Country | StackOverflowJobsRecommend | VersionControl | Age | Years Experience | Gender | RawSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2/28/18 20:20 | Bachelor's degree (BA. BS. B.Eng.. etc.) | NaN | Yes | South Africa | NaN | Git | 21 | 13 | Male | NaN |
1 | 6/28/18 13:26 | Bachelor's degree (BA. BS. B.Eng.. etc.) | 70841.0 | Yes | Sweeden | 7.0 | Git;Subversion | 38 | 9 | Male | 70,841.00 |
2 | 6/6/18 3:37 | Bachelor's degree (BA. BS. B.Eng.. etc.) | NaN | No | Sweeden | 8.0 | Git | 45 | 11 | NaN | NaN |
3 | 5/9/18 1:06 | Some college/university study without earning ... | 21426.0 | Yes | Sweeden | NaN | Zip file back-ups | 46 | 12 | Male | 21,426.00 |
4 | 4/12/18 22:41 | Bachelor's degree (BA. BS. B.Eng.. etc.) | 41671.0 | Yes | UK | 8.0 | Git | 39 | 7 | Male | £41,671.00 |
While having a summary of how much of your data is missing can be useful, often you will need to find the exact locations of these missing values
Age | Gender | |
---|---|---|
0 | 21 | Male |
1 | 38 | Male |
2 | 45 | NaN |
3 | 46 | Male |
4 | 39 | Male |
5 | 39 | Male |
6 | 34 | Male |
7 | 24 | Female |
8 | 23 | Male |
9 | 36 | NaN |
Age Gender
0 False False
1 False False
2 False True
3 False False
4 False False
5 False False
6 False False
7 False False
8 False False
9 False True
Issues with deletion
It deletes vaild data points
Relies on randomness
Reduces information
Listwise deletion
The simplest way to deal with missing values in your dataset is to remove them, also called ‘listwise deletion’.
Sometimes you’ll want to remove all missing values in your data, and sometimes you’ll just want to remove a column if too many values are missing in it.
(264, 11)
(999, 7)
In many cases, removing all missing data may be the right thing to do, but you’re omitting a lot of information.
It’s possible to find categorical columns with missing values that are valid information in themselves, like someone refusing to answer a survey question. Fill in all missing values with a new category entirely, like ‘No response given’.
Male 632
Female 53
Female;Male 2
Transgender 2
Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming 1
Male;Non-binary. genderqueer. or gender non-conforming 1
Non-binary. genderqueer. or gender non-conforming 1
Female;Transgender 1
Name: Gender, dtype: int64
Male 632
Not Given 306
Female 53
Female;Male 2
Transgender 2
Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming 1
Male;Non-binary. genderqueer. or gender non-conforming 1
Non-binary. genderqueer. or gender non-conforming 1
Female;Transgender 1
Name: Gender, dtype: int64
Deleting missing values
Can't delete rows with missing values in the test set
What else can you do?
Categorical columns: Replace missing values with the most common occurring value or with a string that flags missing values such as 'None'
Numerical columns: Replace missing values with a suitable value
Filling continuous missing values
In the last lesson, you dealt with different methods of removing data missing values and filling in missing values with a fixed string. These approaches are valid in many cases, particularly when dealing with categorical columns but have limited use when working with continuous values. In these cases, it may be most valid to fill the missing values in the column with a value calculated from the entries present in the column.
0 NaN
1 7.0
2 8.0
3 NaN
4 8.0
Name: StackOverflowJobsRecommend, dtype: float64
0 7.061602
1 7.000000
2 8.000000
3 7.061602
4 8.000000
Name: StackOverflowJobsRecommend, dtype: float64
so_survey_df[‘StackOverflowJobsRecommend’].fillna(so_survey_df[‘StackOverflowJobsRecommend’].mean(), inplace=True)
so_survey_df[‘StackOverflowJobsRecommend’] = round(so_survey_df[‘StackOverflowJobsRecommend’])
print(so_survey_df[‘StackOverflowJobsRecommend’].head())
Dealing with stray characters (I)
C:\Users\dghr201\AppData\Local\Temp\ipykernel_28724\3444689385.py:4: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('$', '')
In the last exercise, you could tell quickly based off of the df.head() call which characters were causing an issue. In many cases this will not be so apparent. There will often be values deep within a column that are preventing you from casting a column as a numeric type so that it can be used in a model or further feature engineering.
One approach to finding these values is to force the column to the data type desired using pd.to_numeric(), coercing any values causing issues to NaN, Then filtering the DataFrame by just the rows containing the NaN values.
0 NaN
2 NaN
4 £41671.00
6 NaN
8 NaN
...
989 NaN
990 NaN
992 NaN
994 NaN
997 NaN
Name: RawSalary, Length: 401, dtype: object
0 NaN
1 70841.0
2 NaN
3 21426.0
4 41671.0
...
994 NaN
995 58746.0
996 55000.0
997 NaN
998 1000000.0
Name: RawSalary, Length: 999, dtype: float64
When applying multiple operations on the same column (like in the previous exercises), you made the changes in several steps, assigning the results back in each step. However, when applying multiple successive operations on the same column, you can “chain” these operations together for clarity and ease of management. This can be achieved by calling multiple methods sequentially:
0 NaN
1 70841.0
2 NaN
3 21426.0
4 41671.0
...
994 NaN
995 58746.0
996 55000.0
997 NaN
998 1000000.0
Name: RawSalary, Length: 999, dtype: float64
C:\Users\dghr201\AppData\Local\Temp\ipykernel_28724\2032499489.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
so_survey_df['RawSalary'] = so_survey_df['RawSalary']\